For the sake of simplicity, I choose only features "sqft_living", "sqft_lot", "view", "condition", "grade" (they should provide qualitatively different/uncorrelated information) and run a fairly robust ML algorithm - random forest - with default parameters to predict the price.
Afterwards, I evaluate it using easily interpretable scores (such as 'explained variance' or 'mean absolute percentage error') based on the cross validation performance of the trained model.
Obtained scores seem surprisingly good, maybe even too good? That would require further inspection.
import pandas as pd
import plotly.express as px
df = pd.read_csv(('kc_house_data.csv'))
print("number of rows:", df.shape[0])
df.sample(3)
# Here I assume that there are no missing values, however normally I'd have checked
# whether there aren't any 'magic values' serving the role of nulls.
df.isnull().sum(axis = 0)
df.describe()
COLS = ["price", "sqft_living", "sqft_lot", "view", "condition", "grade"]
px.scatter_matrix(df[COLS], height=1000)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
CV_K = 3
X, y = df[COLS], df["price"]
cross_val_score(RandomForestRegressor(random_state=0),
X, y, cv=CV_K, scoring='explained_variance')
cross_val_score(RandomForestRegressor(random_state=0),
X, y, cv=CV_K, scoring='neg_mean_absolute_percentage_error')